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Despite its enormous progress in the last few decades, machine vision is still far from achieving the 
goal that human vision attains with such speed and reliability - in David Marr's words, to "know 
what is where by looking." (Marr, 1976). Recent results in the physiology and psychophysics of 
visual attention accentuate the gap between machines and humans, and provide a first step to 
understanding why it is so large and what machines must learn in order to overcome it. 

Paradoxically, what appear to be the simplest tasks for humans may be the most difficult for 
machines. Consider, for example, recognizing your mother in a sketch of her sitting in the kitchen. 
You could immediately and effortlessly locate her face, match it with your memory, and 
pronounce it a good or bad likeness. If the sketch were upside-down you could easily right it for a 
proper view. You would probably expend the most painstaking scrutiny in determining just which 
feature was slightly off, but even so, your final judgment would be quick. In contrast, a computer, 
using the most sophisticated face recognition routine, would perform the task slowly and 
incompletely, because it would not know where to start. Given the location of the two eyes in a 
sketch cluttered with dark round blobs, the routine could then search for the mouth, nose and chin 
at the appropriate distances and methodically match each feature to a virtually identical image in its 
memory. But failing to find the eyes, it could not go on to recognize the face. 

The difficulty of the face recognition problem—and, more generally, object recognition — has called 
into question one of the main assumptions underlying the construction of a machine that sees as 
humans do. The assumption holds that the goal of the first stages in vision is solely to determine 
"where" things are — that is, to transform the initial image, an array of intensity values, into a map 
of the scene which records the distance and orientation of each surface point relative to the viewer 
(the "2-1/2D sketch"). In machine vision the 2-1/2D sketch may serve to guide a mobile robot 
around an obstacle or to control its manipulations as it picks up a tool. But, like the raw image 
from which it is computed, the 2-1/2D sketch is itself simply a large array of numbers. Although 
it may contain preliminary information for object recognition, by assigning a color or texture to 
each surface point, it does not tell "what" things are. The critical task in object recognition is 
therefore to find, the object or its crucial part within an array of intensity values or distances. Until 
now, many of the attempts to elucidate object recognition (reviewed by Besl and Jain, 1985 and 
by Harmon et al., 1979, for example) have assumed that the relevant object is already located and 
isolated in the image. 


Unlike machines, humans are adept in spotting the salient features of an object. To understand the 
mechanisms underlying this ability, psychophysicists have investigated visual attention. Treisman 
(1983) and Julesz (1984) have demonstrated that humans are extremely efficient in detecting a part 
of an image that differs in a single aspect from its background. For example, a red dot {\it pops 
out} in a field of yellow dots, and the same happens for a vertical line in a field of horizontal lines. 
The time required to detect the unusual item is independent of the number of other items, implying 
that the search for it occurs in parallel across the entire field. The human visual system obviously 
possesses a fast, parallel mechanism which can direct attention to salient chunks of the image. 
Although the possible computational purposes of this mechanism have not been probed by 
psychophysical experiments, its potential role in object recognition seems critical. For example, in 


1 This mechanism is sometimes called "preattentive." Here, we consider it as part of the entire 
attention mechanism whose characteristics probably require more complex descriptions than 
'"serial" or "parallel." 



Hurlbert & Poggio 


Visual Attention in Brains & Computers 


face recognition the attention mechanism may perform two essential steps: first, to locate "blobs" 
which could be eyes; and second, to direct processing toward the blobs to verify that they are eyes 
and thereby to initiate recognition. The role of attention, therefore, may be not only to spotlight 
distinctive parts of the image but, more importantly, to segment the image into objects or parts of 
objects, a crucial first step in determining what things are. 

An important and still open question is: what are the features or primitives that drive attention? 
Likely candidates are separable features which, by definition, can be attended to selectively and 
are processed independently and in parallel. Pop-out and texture discrimination experiments 
provide a test for separable features and so far have diagnosed color, line orientation, line ends 
(terminators) and possibly crossings as candidates. 


Conjunction experiments test whether two or more separable features may combine to produce a 
higher-order primitive. For example, when a green T in a field of randomly mixed green Xs and 
brown Ts is the target, it does not pop out, and the time required to detect it increases linearly with 
the number of background items. Thus the detection of a particular conjunction of color and shape 
appears to require a search over each item in turn, across the entire field. Conjunction experiments 
thus reveal another aspect of the attention mechanism, a serial searchlight which appears to 
operate independently of eye movements and does for feature conjunctions what the parallel 
mechanism does for features. 


Until recently, all conjunctions between known separable features had been shown to require the 
serial searchlight. The recent results of Nakayama and Silverman (1986) reveal a surprising 
exception to this pattern. In pop-out experiments using fields of small rectangular patterns 
displayed on a color television monitor, the authors demonstrate that binocular disparity and 
motion individually behave as separable features. The conjunction of motion and color does not; 
the search for a pattern of blue upward-moving dots is slow and serial across a field of 
blue-downward patterns and red-upward patterns. Contrary to this trend, conjunctions of 
binocular disparity and either color or motion behave as separable features: they are searched for in 
parallel. (The authors report that when the field splits into two planes, one in front of the other, the 
search for a conjunction amounts to a pop-out of the unusual item in one plane. Thus we suggest 
that it may be possible, using a different kind of motion stimulus, to create separate planes of 
coherent motion and thereby induce a parallel search for motion-color conjunctions.) 


The psychophysical studies on separable features coincide with the recent emphasis on functional 
localisation in visual neurophysiology and neuroanatomy. It is tempting to draw an explicit 
connection between biology and psychophysics by equating different visual cortical areas with 
different feature maps; for example, calling V4 the color map, MT the motion map, and VI the 
orientation map. The psychophysics would suggest that in a given feature map, at each spatial 
location there exists a collection of neurons each tuned to a different value of the feature (e.g. red, 
green or blue for color). Although such an organization has not been demonstrated, the evidence 
for segregation of functionally similar neurons in distinct cortical areas is steadily accumulating. 
From this point of view, the results of Nakayama and Silverman have interesting implications for 
neurons and feature maps: they preclude the existence of neurons tuned for both motion and color, 
and predict the existence of neurons tuned to a particular combination of binocular disparity and 
motion and of neurons tuned to disparity and color. The results also suggest that feature maps may 
be replicated at each of several disparity planes. The prediction of disparity-motion tuned neurons 
is supported by Maunsell and van Essen's (1983) report of similar neurons in cortical area MT. 


3 
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Other recent work in visual neurophysiology puts the emphasis on a different aspect of attention. 
Rather than address computational questions such as, "what are the salient features?" and "how 
does the attention mechanism work?" or the psychophysical question "is feature processing parallel 
or serial?", the new class of physiology experiments on alert animals seek to demonstrate the ways 
in which attention can modulate neuronal responses. In the course of such experiements, insights 
into the neural circuitry and anatomical location of the attention mechanism have emerged. For 
example, based on studies of attention-mediated modulation in the inferior parietal lobe (area 7), 
Lynch et al. (1977) have proposed that neurons there are responsible for directing attention to 
visual targets. 

More recent research has demonstrated the effects of attention at other levels in the visual pathway. 
Moran and Desimone (1985) have recently shown that, in the monkey, the response of a neuron in 
V4 or IT to a preferred stimulus (for example, a red horizontal bar) is dramatically reduced when 
the animal ignores it and instead attends to an ineffective stimulus (such as a green vertical bar) 
within the same receptive field (which, for IT neurons, may extend at least 12°). The response of 
the neuron to the preferred stimulus is unaffected when the attended stimulus is outside its receptive 
field. Thus V4 and IT neurons are able to filter out an irrelevant stimulus when it competes with a 
relevant stimulus within the same receptive field. VI neurons do not have this property, and the 
monkey can not even perform the differential attention task when the two stimuli are close enough 
to fit within a single receptive field in VI. 


A recent psychophysical experiment in humans by Sagi and Julesz (1986) provides an intriguing 
complement to these physiological results. Sagi and Julesz find that visual attention directed to a 
random location for an orientation discrimination task enhances the detection of a test flash 
presented simultaneously within a certain radius of the target. The area of enhancement, which the 
authors conjecture to be the area covered by the searchlight of attention, varies from 1.5° at 2° 
eccentricity to about 3° at 4° eccentricity. Interestingly, these areas are likely to be larger than the 
average receptive field sizes in VI. 

The above results imply that attention to one region of an image may involve both suppression of 
visual processing in irrelevant regions and enhancement of visual processing in relevant regions. 
Thus attention may indeed be responsible for directing a processing focus to specific locations in 
the initial steps of recognition. Yet although biological research may have found the key to machine 
vision, it has yet to describe how it opens the lock. 

Computational results suggest that the attention mechanism may be even more complex and 
powerful than experiments have revealed. Consider again the face recognition problem. Individual 
features such as eyes or the curved line of nose and mouth can by themselves lead to the hypothesis 
of a face (see figure 2a). In contrast, as figure 2b shows, features alone cannot be the only cue for 
recognition. The spatial relationship between the two eye tokens and the closed outer contour can 
also cue the face recognition process. Ullman (1984) has argued cogently that spatial relations 
must be computed by a mechanism similar to the serial searchlight of attention. 

The unraveling of the full complexity of visual attention will clearly involve computational, 
psychophysical, and physiological research and in turn will influence not only our understanding of 
visual perception but also the architecture and the control structure of machine vision systems. 


4 
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Figure 1. A patch of horizontal line segments "pops out" in a field of vertical line segments. 



Figure 2. In (a) each separate set of "face" features is sufficient to suggest the hypothesis of a face. 
In (b) it is the spatial relation between features and not the features themselves that cue recognition 
to the face hypothesis. 
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